In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
In this section we will be analyzing some financial data. Now pandas gives us access to some data through pandas.io.data
This is basically pandas remote data access: http://pandas.pydata.org/pandas-docs/stable/remote_data.html
Functions from pandas.io.data and pandas.io.ga extract data from various Internet sources into a DataFrame. Currently the following sources are supported:
So let’s grab some stocks from yahoo data with pandas.io.data. I've seen this list actively change so it's a good idea to see what is available to you - there's likely some really cool plugins that will continue to be built.
In [2]:
import pandas.io.data
In [3]:
?pandas.io.data # <tab>
Now there’s a lot of volatility in oil right now. It’s been rough for producers to say the least. So let’s check out some stocks that are involved in that market.
First we’ll set start and end dates these are just date times. Now I can do this with date times in python but I can also just get a datetime with pandas which can parse a string to pull out a date time. This ends up being pretty handy.
In [7]:
import datetime
print(datetime.datetime(2010,1,1))
WTI which is W&T Offshore Inc. They drill in the gulf of Mexico.
Let’s also check out
CHK or Chesapeake Energy Corporation.
Tesla Motors
and finally CBAK which is China Bak Battery Incorporated.
In [8]:
start = pd.to_datetime('2010-1-1')
end = datetime.datetime(2015,1,1)
ticker_symbols = ['WTI','CHK','TSLA','CBAK']
In [9]:
wti = pd.io.data.get_data_yahoo(ticker_symbols[0],start=start,end=end)
In [10]:
wti.head()
Out[10]:
Now we can get these one by one in a for loop...
In [12]:
for symbol in ticker_symbols:
print(symbol)
df = pd.io.data.get_data_yahoo(symbol,start=start,end=end)
or we can just get them all by passing in the list....
In [13]:
panl = pd.io.data.get_data_yahoo(ticker_symbols,start=start,end=end)
But we get something different back that we haven’t encountered yet. This is a panel. Now panels are advanced and explaining their use case is outside of this video. However, I’ll give you the basics.
Panels are 3 dimensional containers that we can query on each of those dimensions.
In [14]:
panl
Out[14]:
We can see they’ve got an items axis, a major axis and a minor axis. Panels are a core part of pandas however they are much less used in pandas and therefore are a bit neglected as of now. That’s not just me trying to avoid the topic - which would be my suspicion if I heard that. But that ’s stated almost verbatim in the docs.
http://pandas-docs.github.io/pandas-docs-travis/dsintro.html#from-dataframe-using-to-panel-method
In [15]:
type(panl)
Out[15]:
However let’s touch on the basis because you may come across them.
We've got a lot of the basic methods like shape.
In [16]:
panl.shape
Out[16]:
We’ve got these three axes so we want to query data in them. We've got to do that a bit differently.
Since we know these axis values we can query them.
Now items are done like standard DataFrame columns with dot syntax.
In [17]:
panl.Open.head()
Out[17]:
The major and minor axises are done differently. with the major_xs and minor_xs commands.
In [18]:
panl.major_xs('2013-5-1')
Out[18]:
In [19]:
panl.minor_xs('CHK').head()
Out[19]:
Some summary statistics are available to us like mean on the panel.
In [20]:
panl.mean()
Out[20]:
We can perform different kinds of selections and transposition using the major and minor axes however as I said above I'm not going to cover this material. However I’m going to convert this panel to a data frame to show you how to do that and in the process and we’re going to cover a new topic!
Now when we convert the panel to a data frame with the to_frame command - we can see it looks a bit different.
In [21]:
df = panl.to_frame()
df.head()
Out[21]:
But when we use the head method to see the first 5 we see things are a little different and that’s because we now have multiple indices or a hierarchical or multi index. Now hierarchical indexes are extremely powerful but they’re beyond the scope of this current video - I'll touch on them a bit later in this section.
What you need to know right now is that there are levels that are stacked on one another and those can be queried.
In [22]:
df.index.levels
Out[22]:
In [23]:
print(len(df.index.levels))
Alright we have our dataset in a hierarchical index. but that's not what we want to work with right now which will likely come up when you're analyzing data - you'll want to completely reset your index. Well have no fear, we can do that with the reset index command.
You may find yourself using this often just to get back to square one and start over when performing analysis.
In [24]:
df.reset_index()
Out[24]:
In [25]:
df.reset_index(inplace=True)
Now we've reset our index. Now I don't want to work with this data set as I'm not an expert on financial data however we'll be working with a really cool data set in our next video. We'll be working with an airplane data set that has flights across the country. This is going to give us the opportunity to work on a ton of cool problems.
In [26]:
df
Out[26]:
In [ ]: